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ABSTRACT 

Camouflaging data by generating fake information is a well- 
known obfuscation technique for protecting data privacy. 
The effectiveness of this technique in protecting users’ pri¬ 
vacy highly depends on the resemblance of fake information 
to reality, such that an adversary cannot easily filter such 
fake information out. In this paper, we focus on a very 
sensitive and increasingly exposed type of data: location 
data. There are two main scenarios in which fake traces are 
of extreme value to preserve location privacy: publishing 
datasets of location trajectories, and using location-based 
services. Despite advances in protecting (location) data pri¬ 
vacy, there is no quantitative method to evaluate how realis¬ 
tic a synthetic trace is, and how much utility and privacy it 
provides in each scenario. Also, the lack of a methodology 
to generate privacy-preserving fake traces is evident. In this 
paper, we fill this gap and propose the first statistical metric 
and model to generate fake location traces such that both 
the utility of data and the privacy of users are preserved. 

We build upon the fact that, although geographically they 
visit distinct locations, people have strongly semantically 
similar mobility patterns, for example, their transition pat¬ 
tern across activities (e.g., working, driving, staying at home) 
is similar. We define a statistical metric and propose an al¬ 
gorithm that automatically discovers the hidden semantic 
similarities between locations from a bag of real location 
traces as seeds, without requiring any initial semantic an¬ 
notations. We guarantee that fake traces are geographically 
dissimilar to their seeds, so they do not leak sensitive loca¬ 
tion information. We also protect contributors to seed traces 
against membership attacks. Interleaving fake traces with 
mobile users’ traces is a prominent location privacy defense 
mechanism. We quantitatively show the effectiveness of our 
methodology in protecting against localization inference at¬ 
tacks while preserving utility of sharing/publishing traces. 

1. INTRODUCTION 

Fake (dummy) information can protect privacy and se¬ 
curity in many different systems such as web search [11] , 
anonymous communications [s] [8 , authentication systems 
[^, and statistical analysis |17[ 23]. In all these scenar¬ 
ios, the main challenge and the open problem is to generate 
context-dependent fake information that resembles genuine 
user-produced data and also provides an acceptable level of 
utility while enhancing privacy of users. 

In this paper, we propose a systematic approach for pre¬ 
serving privacy of loeation data using fake traces. We focus 
on two practical scenarios: sharing location with location- 


based services, and publishing location datasets e.g., for re¬ 
search. In location-based systems, users hide their true loca¬ 
tion among fake location traces while connecting to a server 
to obtain contextual information about their whereabouts. 
This protects them against the location inference attacks. 
The benefit of the fake injection approach with respect to 
other obfuscation techniques, such as location perturbation 
[^[^[^, is that it does not reduce the users’ experienced 
service quality. Users only incur the overhead of filtering out 
the received information about fake locations. In publish¬ 
ing privacy-preserving location datasets, the purpose is to 
preserve the general statistics about human mobility. There 
is a utility loss associated with fake traces as they might 
not fully preserve all the characteristics of real traces. The 
challenge is to generate synthetic traces that semantically 
resemble the real traces yet do not leak information about 
the exact geographic locations visited by any particular in¬ 
dividual. This gives rise to a tradeoff between utility and 
privacy that is inherent to privacy-preserving systems. 

There has been some preliminary work on using fake lo¬ 
cation queries to protect users’ location privacy |13| |15| 
26 3^. However, they are based on very simple heuristics 


such as i.i.d. location sampling, sampling locations from a 
random walk on a grid with uniform probability, and us¬ 
ing road trip algorithms to generate driving traces between 
two random locations. In this paper, we quantitatively show 
that these methods fail to protect location privacy against 
inference attacks. Besides, what these methods are missing 
are (i) a metrie that captures how realistic a synthetic loca¬ 
tion trace is with respect to human mobility, so it cannot be 
easily detected by attacker, (ii) a generative model that pro¬ 
duces samples of synthetic yet realistic traces according to 
such a metric, while preserving utility and ensuring that the 
synthetic traces do not themselves leak information about 
any individual. In this paper, we present the first formal 
methodology to solve these problems and to generate fake 
yet semantically real location traces for protecting location 
privacy. We also enforce, and quantitatively measure, pri¬ 
vacy against location privacy attacks. 

Our scheme is based on the fact that mobility patterns of 
different individuals share some semantic features, regard¬ 
less of which geographic locations they visit. These common 
features of human mobility stem from their similar lifestyles. 
The mobility patterns share a similar structure that reflects 
the general behavior of a population (even at a high level 
[^). We model the mobility of each individual in two di¬ 
mensions: geographie and semantic. The geographic features 
are mostly specific to each individual (hence are sensitive). 






whereas the semantic features are mostly generic and repre¬ 
sentative of human mobility behavior (hence are useful). We 
extract the common semantic features of mobility patterns 
and use them to generate realistic synthetic traces, without 
leaking the geographic features of any individual’s locations. 

Consequently, we define two metrics to quantify the sim¬ 
ilarity between human mobility models: The geographic 
similarity metric between two individuals captures how cor¬ 
rectly we can predict locations of one knowing the mobility 
model of the other one. This metric helps us to capture the 
spatiotemporal information leakage of a fake trace about a 
real trace. The semantic similarity metric reflects how well 
two traces match in terms of their semantic features. We 
assume that we have a dataset of real traces. We develop 
an algorithm that automatically learns the semantic corre¬ 
lation between locations and transitions between locations Q 
To generate semantically realistic fake traces, we transform 
a (real) seed trace into the semantic domain and probabilis¬ 
tically sample (fake) location traces that are consistent with 
it. Thus, a generated sample resembles a typical sequence of 
locations that could have been visited by some real individ¬ 
ual. We then design a rejection sampling assertion to ensure 
that a fake trace does not leak information about the loca¬ 
tions in its seed trace, i.e., we reject those who do not meet 
the privacy requirements. Thereby, we protect privacy of 
seed traces against the following threats: Inference attacks 
(to learn which locations the seed contributors have visited), 
and membership inclusion attack (to learn if a particular in¬ 
dividual with certain semantic habits has been in the seed 
dataset). If a fake trace’s geographic similarity or its inter¬ 
section with its real seed trace exceeds a given threshold, we 
reject it and sample a new trace. We also ensure that the 
semantic similarity between fake and seed traces cannot be 
used against anonymity of seed traces. To this end, we re¬ 
ject the fake trace if there is no /c — 1 alternative real traces 
that could have been the seed for generating the fake (i.e., 
the differential semantic similarity with the fake trace is be¬ 
low a threshold). This additionally guarantees a plausible 
deniability for each seed trace. The resulting pool of fake 
traces can later be drawn upon for use by e.g., users’ smart¬ 
phones. The fake traces generated from the seed database 
can be used to protect location privacy of any user (not the 
contributed of the seed database). The privacy tests guar¬ 
antee that we preserve privacy of seed traces. Additionally, 
we show that the generated fakes can significantly protect 
privacy of LBS users against inference attacks. 

Our Contributions: In summary, the novelty of this work 
is twofold. (1) We introduce the notion of semantic similar¬ 
ity for mobility patterns, we propose both a metric for it, and 
an algorithm to quantify the semantic similarity between lo¬ 
cation traces. We also automatically learn the semantic re¬ 
lation between locations. (2) We propose a generative model 
for fake location traces that are semantically similar to real 
traces. We also guarantee plausible deniability to individu¬ 
als whose real traces are used as seed in our algorithm. Our 
software tool, given a set of real location traces, generates 
fake traces based on our theoretical framework. We run ou r 
algorithms on a real-world dataset collected by Nokia p^ . 
We show the effectiveness of our fake traces in protecting pri- 

^Note that we do not annotate the locations (as home, work, 
...), nor we use any annotation as an input to compute the 
similarity between locations. We only rely on location traces 
as input. 
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Figure 1: Sketch of the proposed scheme. 


vacy of users in two main scenarios: location-based services 
and published location datasets, while preserving utility. 


2. OUR SCHEME 

In this section, we present a sketch of, and describe the 
main intuition behind, our scheme for generating fake traces. 
We assume that time and space are discrete, so a location 
trace is represented as a sequence of visited locations over 
time. In our scheme, we generate a fake trace through a 
multiple step process. Figure [^illustrates the details. 


2.1 Computing the Semantic Similarity 

The first step is to compute the semantic similarity be¬ 
tween the set of locations in an area. We learn these sim¬ 
ilarity values automatically from a dataset of real location 
traces. For each trace (i.e., the sequence of locations vis¬ 
ited by a single individual) in the dataset, we first com¬ 
pute a probabilistic mobility model that represents the vis¬ 
iting probability to each location and transition probability 
among the locations (see Section 3.1). The mobility model 
encompasses the spatiotemporal behavior of each individ¬ 
ual with respect to different locations. Time, duration, and 
probability (frequency) of visiting a location, as well as the 
probable comings from and goings to locations are all com¬ 
putable from the mobility model. So, it implicitly reflects 
the types of activities that an individual might carry out in 
each location (and over a sequence of locations). 

We analyze and discover the semantic relation between 
different locations in a consistent manner by considering all 
locations together. To this end, we propose a semantic simi¬ 
larity metric (see Section jS^ . Intuitively, we assign a higher 
similarity value to the pair of locations at which different in¬ 
dividuals have similar spatiotemporal activities. Thus, our 
metric tries to find the optimal way to map the visited lo¬ 
cations in a pair of traces such that the mapping maximizes 
the statistical similarity between their mobility models. The 
semantic similarity metric is therefore the statistical similar¬ 
ity between mobility models under the optimal mapping be¬ 
tween locations. This means that if we were to translate the 
locations of this pair according to the discovered best map¬ 
ping, they would follow the same mobility model when their 
semantic similarity is high. For example, consider Alice and 





















































Bob spending all day at their respective work locations wa 
and wb, and all night at their respective home locations Ka 
and Hb- Obviously, their mobility models are semantically 
very similar, although it might be the case that Ha ^ hB 
and WA ^ wb- In this example, the best semantic mapping 
between locations will be wa ^ wb and Ha ^ hB- 

For each pair of mobility models for traces in our dataset, 
we compute their semantic similarity as well as the best se¬ 
mantic mapping between their locations. We then aggregate 
all the location matchings across all trace pairs, with weights 
based on the semantic similarity between mobility models, 
and construct a location semantic graphs where the nodes are 
locations and the weight of the edges is the average semantic 
similarity between the locations over the dataset. 

2.2 Forming Location Semantic Classes 

The location semantic graph enables us to find what lo¬ 
cations have similar meanings for different people, so they 
have similar activities in those places. The locations that 
have higher semantic similarity can be grouped together to 
represent one location semantic class. To this end, we run a 
clustering algorithm on the location semantic graph to par¬ 
tition locations into distinct classes. Locations that fall into 
the same class are visited in the same way W different peo¬ 
ple regardless of their geographic positions [j Thus, we can 
consider them as being semantically equivalent. So, using 
the notations of our previous example, wa and wb should 
belong to the same cluster that can represent “workplace” 
locations, and Ha and Hb should be grouped into another 
cluster representing residential or “home” locations. 

2.3 Generating a Fake Trace 

We use the location semantic classes as the basis to gen¬ 
erate fake traces. In addition to being semantically realistic, 
the fake traces must be geographically consistent with the 
general mobility of individuals in the considered area. For 
example, the speed of moving in some locations differs de¬ 
pending on the time, or the probabilities of taking different 
paths is different. To capture these patterns, we compute an 
aggregate mobility model from the traces in our dataset. We 
can, for example, compute this by averaging the mobility 
models that we constructed on the traces. 

The goal is to generate fake traces that are semantically 
similar to real traces. In order to construct a fake trace, our 
algorithm starts with a seed trace and converts it to a prob¬ 
abilistically generated semantically similar synthetic trace 
which is consistent with the aggregate mobility model. We 
pick the seed trace at random from the trace dataset. The 
seed trace, similar to other location traces in the database, 
is composed of a sequence of geographic locations. In our 
algorithm, we first transform the geographic seed trace into 
the semantic domain, then we use the transformed seman¬ 
tic trace to sample from the state space of all geographic 
traces that could have been transformed to the same se¬ 
mantic trace. The transformation and sampling procedures, 
which are at the heart of this step, are done as follows. 

For the seed trace transformation, we replace the geo¬ 
graphic locations in the seed with the locations that are 
in the same semantic class. This seed semantic trace is a 
sequence of location sets. For the fake trace sampling, we 
address the following problem. We want to construct a trace 
2 

Their visit probability, time of visit, and the probabilities of transi¬ 
tion from/to them to/from other locations of the same type is similar. 
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A location 
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A distance function (between locations) 
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p and q based on a distance function d(-) 

(T 

A permutation function 

simcfn, v) 

Geographic similarity between mobility of u and v 

sims(u, v) 

Semantic similarity between mobility of u and v 


Optimal semantic mapping between locations of u and v 

S 

Set of real traces used as seeds to generate fake traces 

{P, 

Aggregate mobility model 

C 

A partition on IZ, representing location semantic classes. 
Ci is the set of locations in class (partition) i 

T 

A set of fake locations generated from S 


Table 1: Table of notations 


that follows the aggregate mobility model under the con¬ 
straint that its locations over time are subset of locations of 
the seed semantic trace. Hence, both the fake trace and the 
seed trace can be transformed to the same semantic trace. 
We add some randomness to the locations in the semantic 
trace to allow higher number of possible fake traces. Many 
algorithms can be used to sample the fake trace that satis¬ 
fies our constraints. We make use of dynamic programming 
algorithms that construct the traces efficiently (Section |^. 

We can repeatedly generate fake traces from each seed 
trace in the dataset, each of which having a probability ac¬ 
cording to the aggregate mobility model. After generating 
each fake trace, however, we need to make sure that it is not 
geographically similar to the seed trace. This is because we 
do not want to leak information about the real seed trace. 
To this end, we add a test to compute the geographic simi¬ 
larity (see Section 3.3) between the seed trace and the fake 
trace to reject the sample traces that are more similar than a 
threshold to the seed trace. Thus, we make sure that the se¬ 
mantically similar synthetic traces are indeed geographically 
dissimilar to the traces in our dataset, hence not leaking in¬ 
formation about visited locations in the real traces. 


3. MOBILITY SIMILARITY METRICS 

In this section, we present a probabilistic model for mobil¬ 
ity, and propose two metrics to analyze the geographic and 
semantic similarity between two mobility models. Table 
presents the list of notations that we use in this paper. 

3.1 Mobility Model 

We model the user mobility as a time-dependent first- 
order Markov chain on the set of regions (locations). As 
users have different behavior and mobility patterns during 
different periods of time, we assume that time is partitioned 
into time periods, e.g., morning - afternoon - evening - night. 
So, the mobility profile {p{u),'k{u)) of a given user it is a 
transition probability matrix of the Markov chain associated 







with the user’s mobility (from a region to another), and 
the user’s visiting probability distribution over the regions, 
respectively. Note that these probabilities are dependent on 
each other, and together they constitute the joint probability 
of two regions that are subsequently visited by the user. The 
entry ^ of p{u) is the probability that user u will 

move to region r in the next time instant (which will be in 
time period r^), given that she is now (in time period r) in 
region r. The entry ^^{u) is the probability that user u is 
in region r in time period r. Let the random variable 
represent the actual location of user u at time t, and be 
the time period associated with A^. So, the mobility profile 
of a given user u consists of the following probabilities: 

('^) — = r I AJ^ = r; = r}, 

^;(u) =Pr{ a:. =r;T^ = T} (1) 

This Markovian model can predict the location of an in¬ 
dividual to a great extent, as it takes both location and 
time aspect into account. It can become even more precise, 
by increasing its order, or by enriching its state. We can, 
for example, include multiple granularities of locations, and 
model the mobility of a user on the set of e.g., pair of (loca¬ 
tion, neighborhood) in addition to the time dimension. Our 
framework can incorporate all these new dimensions similar 
to the way we model the time periods. To learn the probabil¬ 
ities of the mobility profile Q, from location traces, we can 
use maximum likelihood estimation (if the traces are com¬ 
plete) or make use of algorithms such as Gibbs sampling (if 
the traces have missing locations or are noisy) |27| . 

3.2 Mobility Similarity Metrics 

We propose two metrics to compare the mobility of two 
users and compute their similarities: geographic and seman¬ 
tic similarity. In this subsection, we describe the intuition 
behind these metrics, and in the following subsections, we 
formally define and provide the algorithms to compute them. 

The geographic similarity metric captures the correlation 
between location traces that are generated by two mobility 
profiles. It reflects if two users visit similar locations over 
time with similar probabilities and if they move between 
those locations also with similar probabilities. Using this 
metric, for example, two individuals who live in the same 
region A and their workplace is in the same region B poten¬ 
tially have very similar mobilities, as they spend their work 
hours in B and most of their free time in A. 

There are very few people whose mobility patterns have 
a high geographic similarity with each other. However, if 
we ignore the exact locations that are visited by different 
people, we observe that they share similar patterns for visit¬ 
ing locations with similar semantics (locations therein they 
have similar activities). For example, most people visit and 
stay at a single location from each evening until its subse¬ 
quent morning. These locations differ from one individual 
to another, but have the same semantic for them: home. 

One can imagine the semantic dimension of locations as a 
coloring on the locations in a map. Instead of computing the 
correlation between location traces at the geographic level, 
we can also compute such correlation at the semantic level 
(by reducing the set of locations to the set of colors and 
computing the similarity of color traces). This is the intu¬ 
ition behind our semantic similarity metric. In this case, 
if the pair of locations that two individuals visit over time 


have the same semantic, their mobility models are also se¬ 
mantically similar (even if they are in two different cities, 
i.e., have no geographic similarity). Hence, in this example, 
if we transform trace A by replacing its locations with their 
corresponding semantically similar locations in trace B, the 
transformed trace becomes geographically similar to B. So, 
two traces are semantically similar if their locations can be 
mapped to each other that accordingly one trace can be 
transformed to a geographically similar trace to the other. 

3.3 Geographic Similarity Metric 

We define this similarity metric based on the Earth Mover’s 
Distance (EMD) for probability distributions. The EMD is 
widely used in a range of applications including image pro¬ 
cessing 1^ . The EMD can be understood by thinking 
of the two distributions as piles of dirt. In this interpre¬ 
tation, the EMD represents the minimum amount of work 
needed to turn one pile of dirt (i.e., one distribution) into 
the other; the cost of moving dirt being proportional to both 
the amount of dirt and the distance to the destination. The 
special case of EMD for probability distributions has been 
shown to be equivalent to the Mallows distance p^ . 

Let X and Y be discrete random variables with probabil¬ 
ity distributions p and g, such that Pr{X = xi} — pi and 
Pr{Y = yi} = respectively, for i = 1, 2,..., n. We also 
have Y^iPi ^ ^ Yi Qi = 1 - 

Definition 1. (From fldf ) Let d{-) be an arbitrary distance 
function between X and Y. The Mallows distance Md{p,q) 
is defined as the minimum expected distance between X and 
Y with respect to d{-) and to any joint distribution function 
f for (X, Y) such thatp and q are the marginal distributions 
of X and Y, respectively. 

M,(p,g):=mm{EH^(X,Y)}:(X,Y)^/,X^p,Y^g}, ( 2 ) 
where the expectation, minimized under f, is 

n n 

E/{d(x,r)} = EE fij d{xi,yj). (3) 

i=l j=i 

In addition to the constraints Y7=i ~ ^ 

fij E O 7 for all i, j, the joint probability distribution function 
/ must also satisfy Y7=i ^ 

Note that, for given p and q, the minimum / is easily 
computed by expressing the optimization problem as a linear 
program. 

Using the previous definition, we define the geographic 
similarity metric based on the Mallows distance. 

Definition 2. Let d(-) be an arbitrary distance function. 
The dissimilarity between two mobility profiles (p(u), 7 r(u)) 
and {p{v),7r{v)) (belonging to individuals u and v), is de¬ 
fined as the expected Mallows distance of the next random 
locations r' and r" according to the mobility profiles of u 
and v, respectively. More formally, it is 

E(^y_){Md{pl[r,T'{u),Pr'r,T'{v))}, (4) 

where p^ ^ (u) and p^ ^ (v) denote the conditional prob¬ 

ability distributions of the next location, given the current 
location and the current and next time periods. The Mal¬ 
lows function is computed over random variables r' and r", 
and the expectation is computed over random variable r and 
time periods r and r'. 


We define the geographic similarity between mobility pat¬ 
terns of u and v as 

simG(u,-y) := 1 - E{M<i(Pr,r,r'(“)>rf,r,r'(l’))}- (5) 


We compute the geographic dissimilarity using the law of 
total expectation. This also clarifies its meaning by showing 
more directly the role of the random variables. 

= Yi Md{Pr,r,r'{u),pl,r,T'{v)) ■P''’"'’"' (u). ( 6 ) 

r,T,T' 

This is simply the average, for each time and location, of 
the EMD between the distributions of the next location of u 
and V. So, for each current location (and time), we use the 
EMD to compute the dissimilarity between the distributions 
representing the next locations of users u and u, respectively. 
The current location is taken according to user u’s mobility 
profile, making this definition asymmetric. 

Eor a particular distance function d(-), the Mallows dis¬ 
tance definition can be expanded and previous expressions 
can be further simplified. This is the case for d(i, j) := 
for which Md{p,q), for arbitrary probability distributions p 
and q, has closed form 1 — min {pi, qj}. 

Using the dissimilarity metric, we can compute the geo- 
graphie similarity between the mobility profiles {p{u),7r(u)) 
and (p(u),7r(u)), for any distance function (e.g., hamming 
distance. Euclidean distance). Eor example, considering ham¬ 
ming distance d{r,r') = the geographic similarity is: 

r,T,T' \ r' / 

= Yj Pr,T{u)n’''^{u)mm{p'^[^^^,{'u),pl-,r,r'{v)}. (7) 

r,r' ,t,t' 

We emphasize that this definition leads to an asymmetri¬ 
cal similarity measure, i.e. the similarity of to u need not 
be the same as the similarity of u to i/. In principle, this met¬ 
ric can also be computed using measures other than EMD. 
Eor example, one can use Kullback-Leibler divergence mea¬ 
sure to compute the difference between two probability 
distributions, ignoring the distance between the locations. 
We emphasize that we use EMD, in our geographic similar¬ 
ity metric, as we also want to include the distance function 
d(-) between locations while computing the difference be¬ 
tween two distributions (i.e., mobility models). 

Consider now the computation of the geographic similar¬ 
ity. Eor the case, d{r,r') = the computation accord¬ 

ing to closed-form of 0 takes 0{T^ ■ B?) operations, where 
T is the number of time periods and R is the number of 
locations (regions). Eor arbitrary d(-) with no closed-form 
expressions, the geographic similarity is obtained through 
• R EMD computations. Each of these EMD compu¬ 
tations involves minimizing the Mallows distance, that is 
equivalent to solving the linear program given by 0. 

3.4 Semantic Similarity Metric 

The semantic similarity metric builds upon the basic as¬ 
sumption that for two individuals u and v there exists an 
(unknown) semantics mapping a of locations IZ onto itself 
(i.e. a permutation) such that 7^, for u and cr“^(7^), for v 


semantically match. It is important to note that assuming 
such a mapping does not commit us to trying to learn it 
based on modeling location semantics directly. Instead, we 
define the (hidden) semantic similarity between u and v as 
the maximum geographic similarity taken over all possible 
mappings a. We define semantic similarity metric as follows. 

Definition 3. Let a be the mapping of loeations of u to 
loeations of v. Let r, R, and y" be random variables for 
loeations, and r and t be two time periods. We define the 
semantie dissimilarity between u and v for moving in the 
sequenee of time periods {r, as 

■■= mmE 

where the Mallows distanee Md{-) is eomputed over the ran¬ 
dom variable y' and the expeetation is eomputed over the 
random variable r given time periods r and r'. 

Now, we define the semantic similarity between u and v 
over any sequenees of time periods as 

sims(w,f) := 1 - E [I?”({r,r'})] . (9) 

What we compute in 0 is the minimum geographic mo- 
bility dissimilarity between u and v where the locations of 
v are relabeled and mapped to locations of u according to 
the permutation function (which is the a that minimizes 
[^. The intuition is the following. Consider two individu¬ 
als u and V are at r and cr(r), respectively, at time period 
T. The Mallows distance Md computes how dissimilar their 
movement will be to the next location which are represented 
with random variables r' for u and cr{Y") for v. If, according 
to a mapping, the way that they move between these loca¬ 
tions is similar, they behave similarly with respect to those 
locations. If this is true across different time periods and 
location pairs, their mobilities are similar. So, the semantic 
similarity between two individuals is determined by 

We compute this metric at two different levels of accu¬ 
racy of the mobility model. If we only consider the visiting 
probability tt part of each individual’s mobility profile, we 
compute sims as follows. Let us consider the hamming dis¬ 
tance function d{r,r') = In this case, we can 

compute the semantic similarity metric as 


1 —^^Pr{T} maxy^min{7r^(iz), 7rf^^\v)}. (10) 


Note that the computation of (10) requires finding the 


mapping a which maximizes the inner term for each time 
period r. Since there are R\ possible candidates for the 
maximum mapping cr, a brute-force approach is inefficient. 
However, the problem’s structure resembles that of a linear 
assignment. Eocusing on the inner sum, we see that each 
term (each r) can be associated with R values of a{r) inde¬ 
pendently of the other components of cr. To recast the prob¬ 
lem as a linear assignment, we construct a bipartite graph 
where the nodes represent IZ and cr(7^), and each edge rep¬ 
resents the association (through a) of r with cr(r). The max¬ 
imum weight assignment of the constructed bipartite graph 
gives the permutation cr. The running time o f thi s procedure 


is 0(T ‘ R'^) using the Hungarian algorithm 20 

We compute the semantic similarity for the case where we 
consider the more accurate mobility profile (p, tt) as follows. 

1 - Y 


r,r' 


( 11 ) 




It is not known whether there is an efficient algorithm to 
compute the semantic similarity according to (11). The dif¬ 
ficulty comes from having to consider assignments of pairs: 
(r, r') to {cF{r)^(T{r'))^ which makes this computation resem¬ 
ble the Quadratic Assignment Problem (QAP) [^, known 
to be NP-Hard and APX-Hard. The semantic similarity 
can nevertheless be computed through approximation tech¬ 
niques such as Simulated Annealing [^, or the Metropolis- 
Hastings algorithm [^. Nevertheless, 0 can be approxi¬ 
mated using techniques such as Simulated Annealing , or 
the Metropolis-Hastings algorithm. [^. We use this algo¬ 
rithm to compute the semantic similarity metric, in the case 
of considering both visiting and transition probabilities of 
the individuals’ mobility models (see for details). The 
idea is to find good approximations to the quantity of inter¬ 
est (for us a) through probabilistic local exploration of the 
solution space. At each step, we replace our current permu¬ 
tation with a new solution randomly selected from its neigh¬ 
bors (e.g. a permutation which differs in two positions). The 
output of the algorithm is the best permutation found so far 
when the algorithm terminates (after some fixed number of 
iterations). It is known that the starting permutation can 
have an impact on the quality of the output. In our case, 
we expect the permutation found during the computation of 
(10) to be a good starting point. 


4. GENERATING FAKE TRACES 

In this section, we present the details of our algorithms for 
sampling fake traces. Figure presents a high-level view. 
The process of generating and using fake traces are com¬ 
pletely separate. When a set of fake traces are generated, 
they can be used in any protection mechanism accordingly. 


4.1 Transform Traces into Semantic Domain 

We assume that we have a dataset S of real traces that 
we use as seed to generate fake traces. Each seed trace in 
the dataset comes from a different individual. Generating 
a fake trace starts by transforming a real trace (taken as 
seed) to a semantic trace. To this end, we require to know 
the semantic coordinates of the seed trace. We compute the 
semantic similarity between all locations in IZ, and create a 
location semantic graph G{71, E,w) such that the vertices 
are in 7Z and the weight WG(r,r') on the edge between lo¬ 
cations r and r' is the weighted sum of the number pairs of 
users u and v for whom r and r' is semantically mapped (i.e., 
r = cr2(r')), weighted according to their similarity. Then, 
we create the equivalent semantic classes C by running a 
clustering algorithm on this graph. For this purpose, we 
make use of k-means clustering algorithm, and we choose 
the number of clusters such that it optimizes the cluster¬ 
ing objective. We present the sketch of this algorithm in 
Figure [^SemanticClustering(). 

We then convert the seed location trace seed into its cor¬ 
responding semantic trace semseed by simply replacing each 
location in the trace with all its semantically equivalent loca¬ 
tions (according to the semantic classes C). Figure [^depicts 
an example of such a semantic seed. Intuitively, this com¬ 
posite trace encompasses all possible geographic traces that 
have a high semantic similarity to the original seed trace. To 
be more flexible with respect to the traces that we can gener¬ 
ate, we add some randomness to the semantic seed trace. In 
the transformation process of the seed trace into the seman- 


SemanticSimilarity(w, v) 

Compute mobility models (p(u),7r(u)) and {p(v), 7r(v)) 

Compute optimal mapping from 

Compute semantic similarity s\ms{u, v) from 

Return sims(w, v), cr^ 


SemanticClustering(7^, S, n) 


Initiate weighted graph G with locations 

IZ as vertices 

Forall u,v E S,u ^ v. 


Set ^ SemanticSimilarity(u, v) 


Forall locations r,r' E R such that r' 

T 

b 

II 

Set edge weight WG(r,r') wgG, 


Set C ^ K-Means(G, k) 


Return C 



PrivacyTest(/a/ce, seed, pars , pari , pa^d) 

Set geographic similarity sim ^ simcf/a/ce, seed) 

Set intersection between fake and seed int ^ Intersec- 

tion(/oA:e, seed) 

Set dp ^ TRUE if 

... }^_i such that |sims(s, /) — sims(sb /)| < para 

Return TRUE if int < pari and sim < pars and dp 

GenerateFake(7^, S, n, pars) 

Compute aggregate mobility model {p, tt) by averaging 

{p(u), 7t(u)) over all w G 

Set C ^ SemanticClustering(7^, /c) 

Forall seed G S: 

Set C i — C 

Update C' by removing each location in any partition with 
probability pare- 

Set semantic seed semseed seed 

Update semseed by replacing all locations r in the trace 
with the set of locations in their corresponding clus¬ 
ters, i.e., replace r with C'i where r E C'i 

Update semseed by removing the location r = seed{t) at 
any time t with probability pari 

Update semseed by merging locations that are located 
with time distance At with probability parm^^ 

Set fake^ HMMDecode(semsee(i, (p, tt)) 

If PrivacyTest(/a/ce, seed, pars, pari, para) 

Set E ^ E VJ fake 

Return E 


Figure 2: Fake traces generation algorithm. We present it 
simplified for the case with a single time period. 


tic trace, we sub-sample locations from the semantic classes 
as opposed to using them all. For privacy reasons, we re¬ 
move each location in a cluster with probability pare- The 
result is a new cluster C'. We also allow locations of different 
classes to merge into each other closer to the time instants 
where the user moves from one class to the other. We imple¬ 
ment this by merging a location between two cluster visited 
At time instants away with a geometric probability paPm^^■ 










4.2 Sample a Trace from the Semantic Domain 

Any random walk on the semantic seed trace that crosses 
the available locations at each time instant is a valid location 
trace that is semantically similar to the seed trace. However, 
the synthetic traces we want to generate also need to be 
geographically consistent with the general mobility of people 
in the considered area. 

We cast the problem of sampling such traces as a decoding 
problem in Hidden Markov Models (HMMs) [^. The sym¬ 
bols are locations, the observables are the semantic classes 
(or the set of semantically equivalent locations in the same 
class), and the transition probability matrix is our aggregate 
mobility model. We construct the aggregate mobility model 
by averaging over the mobility models of all traces in dataset 
tS, as well as giving a small probability to the possible move¬ 
ments between locations according to their distance and con¬ 
nectivity. More precisely, we compute the aggregate tran¬ 
sition probability as (u)-\-e max(i,d(r,r )) — ^ ^pere 

e is a small constant, and Zr is the normalizing factor. We 
compute the aggregate visiting probability as the average 
of 7r’^(i^), for all G tS. The probability distribution tt is also 
the steady state probability distribution of p. 

By decoding the semantic trace into geographic traces us¬ 
ing HMMs, we generate traces that are probable according 
to aggregate mobility models, i.e., there could be one indi¬ 
vidual who prefers to take that trajectory. There are differ¬ 
ent HMM decoding algorithms. We make use of the Viterbi 
algorithm which is a dynamic programming algorithm to 
generate the most probable trace given the observation (i.e., 
the sematic seed trace) [^. More precisely, Viterbi finds 

arg max Prj/a/ce I semseed(t), (p, '^)} 

fake 

assuming that fake{t) can only choose from locations in 
semseed{t). Finding the most likely fake trace is equivalent 
to finding the shortest path in an edge-weighted directed 
graph where each location at time instant t is linked to all 
locations at the subsequent time in the semantic seed trace. 

By using this encoding technique, we make sure that the 
sampled trace is consistent with the generic mobility and 
has a significant probability of (geographically) being a real 
trace. However, Viterbi produces (only) one trace, hence we 
cannot directly generate multiple fake traces. To address 
this issue, we add randomness to the trace reconstruction of 
Viterbi. We modify the Viterbi algorithm, which originally, 
at each step (time instant) selects the most probable location 
in the path; we add some randomness to the probabilities 
such that the algorithm does not deterministically select the 
most probable location. More precisely, we slightly perturb 
the probabilities in such a way that Viterbi selects randomly 
among a set of locations that are close in probability to the 
most probable location. We implement this idea by choosing 
a parameter pavy and multiplying all the probabilities of 
moving from one location to the next with a random number 
between 1 and pavy . 

4.3 Threat Model 

The threat against fake trace generating algorithms is 
twofold. (1) One threat is directly related to the adversary 
who wants to filter out traces that are fake and to find out 
the true location of mobile users {localization attack), e.g., 
when it is used in location based services to hide the true lo¬ 
cation of the user. For this attack, we assume the adversary 
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Figure 3: A sketch of generating a fake trace from a seed. 
Each location is represented by an English letter in a box. 
The semantic class associated with each location is repre¬ 
sented by a different color. The semantic seed trace is a 
trace that includes the locations in the seed along with other 
locations in the same cluster at each time instant. Here, lo¬ 
cations are clustered as {p, d, /, t, z}, {g, a, w, x}^ {/, &,p}. To 
generate a fake trace, we first probabilistically remove the 
seed location and probabilistically merge subsequent classes. 
In our example, /, z^p are removed, and w, d, 5, x are merged 
into their neighboring visited clusters. We then run a de¬ 
coder to generate a probable trace given the possibility of 
choosing from all available locations at each time instant. 
The fake trace is shown with dashed boxes. A rejection test 
will run on this trace to guarantee its privacy compliance. 


has a background knowledge on general mobility models of 
users in the considered area. In Section |5.6| we quantify 
the success rate of an adversary in localizing users while 
fake traces are used as a defense mechanism. (2) The other 
threat is secondary as it is related to the algorithm that 
generates the fakes, and comes from the adversary whose 
goal is to identify the real individuals whose trace are used 
as seed {membership inclusion attack). In the next subsec¬ 
tion, we enforce a privacy property that protects against the 
second threat. We guarantee plausible deniability for seed 
contributors independently from the adversary’s knowledge. 

4.4 Privacy Tests 

In the end, we want to make sure that the generated fake 
traces do not leak information about the seed trace. We 
design tests to protect against the following threat model. 
We assume the adversary has access to traces (or mobility 
patterns) of some individuals that might overlap with the 
set of individuals who contribute to the seed dataset. The 
attacks depend on the scenario in which a fake trace is used. 
The adversary might be interested in separating fake from 
real (in LBS scenario) or finding the seed from which a fake 
trace is generated (in trace publishing scenario). To protect 
against such threats, as the last step of our process, we run 
a PrivacyTest() on each of the generated fake traces. 

— We compute the geographic similarity of each fake trace 
to the seed trace and reject the fake trace if its similarity is 
higher than a threshold pars- This makes sure that the 
fake trace does not statistically leak information about the 
mobility of the individual behind the seed trace. We also 
reject a fake trace if its intersection with the seed trace is 



























larger than pavi. This makes sure that the exact locations 
visited by the individual are not present in the fake trace. 
These tests provide the privacy guarantee with respect to 
information leakage of visited locations. 

— We also use another notion of privacy, which is more 
of relevance in the case of publishing a location dataset. 
Specifically, we want to defend against membership inelusion 
attacks, in which an adversary wants to infer whether a par¬ 
ticular individual’s data was included in the seed dataset. 
Therefore, we design another privacy test that guarantees 
plausible deniability for those whose trace was used as seed. 
The main idea is that a fake trace should be as semantically 
similar to its own seed as to some other real traces which are 
not included in the seed dataset, so that an adversary cannot 
certainly infer that a particular individual was in the seed 
dataset and de-anonymize the seed contributor. Intuitively, 
if the semantic similarity of a fake trace to its own seed is 
comparable to its semantic similarity with some non-seed 
real trace, then the fake trace could have been generated 
from either of them. To enforce this property, we test if for 
a generated fake trace / (that was generated from a seed 
trace s), there are some other real trace s' such that their 
differential semantic similarity is bounded. 

|sims(s, /) - sims(s', /)| < para 

We can enforce this to hold for a minimum k—1 number of 
alternative s' traces (from which we are not going to publish 
fake traces). Thus, there is at least one real trace outside 
the set of seeds associated with fake traces that could have 
produced each releasable fake trace. More generally, we en¬ 
force the size of anonymity set to be /c. This property, which 
provides plausible deniability^ is conceptually related to, but 
weaker than. Differential Privacy (DP) [^. Indeed, this is 
one kind of guarantee that differential privacy is meant to 
provide]^ That said, we enforce this property for aetual 
seeds in the datasets. Thus, by looking at a fake trace the 
adversary cannot surely conclude that a particular trace was 
in the seed dataset, because there exist at least k—1 other 
traces that could have seeded the same fake trace. 

Each time we generate a new fake trace that passes the 
privacy tests, we compute its likelihood based on the aggre¬ 
gate mobility model. One can then randomly sample from 
the bag of fake traces based on this likelihood. The traces 
that are generated according to this process do not leak in¬ 
formation about the seed traces, yet they share their average 
geographic features and semantic features. 

4.5 Discussion 

The fake generation process, which results in a pool of fake 
traces having passed the privacy test, is run offline on power¬ 
ful machines, before the users’ device retrieve and use such 
fakes. Therefore, this computational burden is not placed 
on the users’ device. Nevertheless, the generation process is 
reasonably efflcient: the computation of both the aggregate 
mobility and the semantic clustering needs to be done only 
once for each input set of real traces. The former takes time 
0{SL+{RTY) where S — |<S| is the number of seed traces, L 
is the length (i.e., number of events) of each seed trace. The 
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DP can be thought of as providing plausible deniability by thinking 
of the differing elements between the two neighboring datasets as two 
possible inputs of the same user. When DP is satisfied, the output 
distribution is approximately the same, so that that user can plausibly 
pretend having used any one of the two inputs. 



(a) Visited Locations. The size of (b) Visited locations colored ac- 
locations are proportional to their cording to their semantic cluster- 
total visits. ing (20 clusters). 


Figure 4: 400 locations visited around Lausanne and nearby 
towns by the 30 users. Some users commute between two 
towns whereas the majority of them live and work in the 
same city of Lausanne (the area with higher concentration). 


latter is dominated by S{S — 1) semantic similarity compu¬ 
tations (e.g., each taking 0(TR^) in the zeroth-order case) 
and one clustering operation. Excluding the final clustering, 
this step is embarrassingly parallel: the semantic similarity 
for any two users u, v can be computed independently. Also, 
if a few input traces are added, both the aggregate statis¬ 
tics and the semantic clustering can be updated and do not 
need to be recomputed from scratch. Once the semantic 
clustering has been computed, an arbitrarily large number 
of fakes for each seed can be generated. This process is also 
embarrassingly parallel, since each fake can be generated 
independently of other fakes for that seed, and other seeds. 

The algorithm of Figure works if the input dataset con¬ 
tains at least two seed traces. However, its quality will be 
high if, among the seed traces: the coverage of the loca¬ 
tion set R is high (not necessarily complete); the semantic 
similarity is high; and the geographic similarity is low. 

5. EVALUATION 

In this section, we run our algorithms on a set of real lo¬ 
cation traces and evaluate their utility and privacy in two 
scenarios: publishing a location dataset, and sharing loca¬ 
tions with a location-based service. 

5.1 Dataset 

The dataset we use for the evaluation is collected thr oug h 
the Nokia Lausanne Data Collection Campaigrj^ (see [l4| ). 
We prepare the dataset for our needs in two phases, filling 
gaps in the traces and discretizing the time and location. 

The raw dataset contains events of three types: GPS (the 
GPS position of the user is known), WLAN (the SSID and 
signal strength of a set wireless networks which surround the 
user are known), and GSM (the identifier of the GSM base 
station to which the user’s phone is associated is known). 
In the first phase, we compute valid traces (out of possibly 
partial traces) by aggregating events and filling gaps. We 
do this by interpolating along the path of consecutive GPS 
points and using the WLAN and GSM information. 

In the second phase, we extract two days of traces for 
30 users, such that each trace contains a sequence of 72 
locations (i.e., one location is reported every 20 minutes). 
Some locations are visited very rarely only by very few users. 

http: / / research.nokia.com / page /11367 









(a) 



(b) 


Figure 5: Normalized histogram of the (a) geographic simi¬ 
larity and (b) semantic similarity of all distinct pairs of 30 
users, (a) Mobility models of different individuals is geo¬ 
graphically very specific to themselves, i.e., they are unique. 
This is well reffected in the skewed distribution of geographic 
similarity towards very small values, (b) As hypothesized in 
this paper, majority of individuals have high semantic sim¬ 
ilarities with respect to their mobility models. 


Thus, we reduce the number of locations from 1491 to 400 by 
clustering close-by locations together. We use a hierarchical 
clustering algorithm for this purpose, in which the distance 
between two locations is taken to be proportional to both the 
Euclidean distance between the locations and the product of 
their weights (dehned as the number of total visits to each 
location, for all users). This means that locations clustered 
together will tend be both geographical close and have few 
visits. The geographical distribution of visits of all users over 
the locations in the considered area is shown in Figure]^ a). 

We computed the mobility profiles of all 30 users, and 
then the semantic location graph by calculating a similarity 
score for each pair of locations, averaged across all users. 
After clustering this semantic location graph, we obtained 
20 location clusters. We choose this number of clusters as 
it provides optimal clustering i.e., it maximizes the ratio 
of inter-cluster similarity over intra-cluster similarity. This 
clustering is illustrated in Figure [^b), where each location 
is drawn with the color of the cluster it belongs to. The 
figure allows us to distinguish some patterns, for example 
locations at the center of cities are mostly in blue, while 
many locations representing roads and highways are colored 
in red. Also notice that the semantic clustering does not 
seem to depend on the geographical distance of locations. 

To illustrate the difference between geographic and seman¬ 
tic similarities, we can compute those metrics pairwise over 
all 30 usersThe result is shown in Figures |^a), and[^b). 
The first histogram shows that the 30 users are not strongly 
geographically similar to each other, except for a few pairs 
of users. This is expected given the range of locations they 
explore overall, as seen in Figure [^a). On the other hand, 
the distribution of the semantic similarity across all distinct 
pairs of users has a larger variance; while some pairs of users 
are not similar at all (e.g., those with semantic similarity 
score of 0.2), a large number of users are highly similar. 

5.2 Fake Trace Generator Tool 

We build our tool to generate fake traces on top of the 
open-source Location Privacy Meter (LPM) [^. To exploit 
LPM’s modularity we split our algorithm into modules. To 
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Since both the geometric similarity, and the semantic similarity of 
a user with herself is 1.0, we exclude such pairs. 


implement the time-dependent sub-sampling of clusters and 
merging around transitions, and the transformation of users’ 
actual traces into semantic traces, we use the location ob¬ 
fuscation mechanism feature of the tool. The reconstruction 
of geographically valid synthetic traces from the semantic 
traces is done using the Viterbi algorithm (implemented in 
LPM as a tracking attack). To cluster the semantic location 
graph, we employ the CLUTO toolkit [m] . 

5.3 Simulation Setup 

As for the parameters of the GenerateFake() algorithm, we 
set the location-removal probability pare to 0.25, and we set 
the location merging probability parm to 0.75. We set the 
probability pari of removing the true location visited in the 
seed to 1.0. We set the randomization multiplication fac¬ 
tor for Viterbi randomization to 4. So, for each probability 
assigned to each location at each time instant, we multiply 
it with a randomly chosen number between 1 to 4. We set 
very tight values for the privacy parameters. We set par^, 
the maximum intersection between fake and seed, to 0. So, 
we do not tolerate any intersection between fake and seed. 
We set the geographic similarity threshold pars to 0.1, and 
the differential semantic similarity threshold also to 0.1. 

For each of the 30 users, we generated about 500 fake 
traces. Out of those we randomly pick 50 traces (for each 
user) to be used for the datasets publishing scenario. For the 
LBS scenario, we sampled traces (for each user) according 
to the traces likelihoods, out of the pool of traces (for that 
user) which passed the privacy test. 

Out of the two days of traces (each 72 timestamps, for each 
of the 30 users), we use the first day as the training dataset, 
and the second day as the testing dataset. We calculate 
the aggregate statistics and mobility prohle of users on the 
training dataset, while we use the testing dataset to evaluate 
both the data publishing scenario and the attack for the LBS 
scenario. Unless otherwise stated, for all experiments, we 
consider a single time-period, and compute the zeroth order 
versions of the geographic and semantic similarities. 

5.4 Evaluation Metrics 

In the following two subsections we evaluate the use of 
fake traces in two popular scenarios: publishing a dataset of 
fake traces, and using fake locations along with real locations 
when accessing location-based services. In both scenarios, 
we evaluate our fake traces with respect to two metrics: pri¬ 
vacy and utility. The privacy guarantee that we provide 
using our privacy rejection test applies to both scenarios. 
However, there are some differences in terms of the adver¬ 
sary model between different scenarios. There are therefore 
some additional considerations regarding the privacy of users 
in location-based services, e.g., their privacy against infer¬ 
ence attacks, that we discuss in its corresponding subsection. 
The utility metric is very dependent on the application (sce¬ 
nario), hence is measured differently in each case. 

5.5 Scenario: Publishing Location Datasets 

5.5.1 Setup 

In this scenario, we assume that we generate a fake trace 
for some seed traces and publish them all in a dataset. We 
use some real traces in our dataset that we use as alternative 
seeds in the differential semantic similarity test. 
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Figure 6: Histogram of the differen¬ 
tial semantic similarity between fake 
and real traces. It presents the dis¬ 
tribution of the absolute difference 
|sims(s,/) — sims(s^/)|, for all pairs 
of / (plus its seed s) and s'. 


Figure 7: Normalized histogram of 
the semantic similarity of all dis¬ 
tinct pairs of: each of the 30 real 
traces, along with their associated 
fake traces. 


Figure 8: Q-Q plot for comparing 
two distributions: semantic similarity 
among all real seed traces, and seman¬ 
tic similarity among all fake traces. 
The plot shows a very strong corre¬ 
lation between two distributions. 


5 . 5.2 Privacy 

We assume an adversary wants to infer information about 
the true traces that have been used to generate the fakes. 
Due to the privacy test, location privacy of individuals who 
contributed to the seed dataset is guaranteed as fake traces 
do not intersect with the locations that are visited and does 
not even leak statistically about what could have been vis¬ 
ited by those individuals. Moreover, due to the differential 
semantic similarity test, the seed traces are not the only 
traces that could have generated the released fake traces. 

Out of all fake traces that we generated from our 30 
seed traces, on average 80% of them could pass the geo¬ 
graphic and intersection privacy tests with tight constraints 
{pavi = 0, i.e., no intersection allowed, and pars = 0.1), 
so it is not difficult to generate fake traces that satisfy such 
privacy guarantees. Regarding the differential semantic sim¬ 
ilarity test, we should be able to find enough number of real 
traces as alternative traces that could have been the seed 
for releasing fake traces. In Figurewe show the difference 
between semantic similarity of a fake trace and its seed with 
the semantic similarity of the same fake trace and any other 
real trace in our dataset. The histogram shows that the ma¬ 
jority of fake traces have very low similarity to real traces 
other than their seeds. This is due to the high semantic sim¬ 
ilarity between real traces (Figure [5^ so it is not difficult to 
hnd potential alternative seeds for a fake trace. We set para 
to 0.1 to obtain a high level of differential semantic privacy. 

5.5.3 Utility 

To preserve the utility of the original traces, fake traces 
should share similar statistical properties with the real trace 
dataset. Note, however, that we would not expect all useful 
statistics to be preserved since some may be counter to our 
goal of preserving privacy. That is, certain geographic fea¬ 
tures are expected not to be preserved, due to the nature of 
the generation and the privacy test. For example, if a loca¬ 
tion is primarily visited by a single user in the real dataset, 
it is unlikely that the location would be visited with similar 
frequency by a (fake) user in the fake trace dataset. This is 
because if such a fake trace were generated from that seed, 
the privacy test would reject it. Nevertheless, we can evalu¬ 
ate to what extent certain useful statistics are preserved. 

To start, we compare the basic mobility statistics obtained 


from the real and fake datasets. We compute the aggre¬ 
gate mobility model for each fake dataset (we generate 10 of 
size 30), and compare its geographic similarity with the real 
dataset. More precisely, for a fake dataset 5^, we compute 
{pT, and compute its similarity to (p, n). The statistical 
similarity of pjr with p over all fake datasets is 

[0.8061, 0.8073, 0.0060] 

on (average, median, standard deviation), and the results 
for the statistical similarity of itjr with it is 

[0.7856, 0.7867, 0.0152]. 

Both these results show a strong correlation between aver¬ 
age/aggregate mobility information of real and fake datasets. 

We then compare the location visiting probabilities of the 
real dataset and fake datasets. Namely, for each dataset we 
compute the spatial allocation, i.e., for each location (from 
least to most popular, for that dataset), we calculate the 
number of visits spent in that location across all traces in the 
dataset. We then normalize this quantity to obtain a proba¬ 
bility distribution over locations (sorted by popularity), i.e., 
for each location we have the probability of a random visit 
to that location. From these distributions, we compute the 
KL-divergence of the real (training) dataset to each of our 
fake datasets, and to a variety of baselinesThe results 
are shown in Table Since the KL-divergence is not up¬ 
per bounded, we use as baselines the KL-divergences of the 
real (training) dataset to the following distributions: real 
testing dataset; uniform visiting probabilities; and single lo¬ 
cation visiting. We see that while the KL-divergence of the 
real (training) dataset to the real testing dataset is smaller 
than that to the fake datasets, the latter is also significantly 
smaller than both the the KL-divergences to the uniform 
visiting baseline and the single location visiting baseline. 
This indicate that a lot of information is preserved in the 
fake datasets. Next, we repeat the previous calculation of 
KL-divergence, but considering only visits to the 50 most 
popular locations (of each dataset). Table shows the re- 

^ Because the KL-divergence is only defined at points where the second 
distribution is not zero, unless the first is also zero, we set all zero 
probabilities to e = 0.1, before normalizing. This is required because 
there may be locations which are visited in the fake dataset but not 
in the real dataset, or vice-versa. 





























suits: the information is almost as well preserved in the fake 
datasets than compared to the real testing dataset. 

We also compare the users time allocation of the real and 
fake datasets. Namely, for each dataset and each user, we 
calculate the time spent at each location, among the loca¬ 
tions visited. That is, we calculate, for the three most pop¬ 
ular locations of that user, what proportion of the time is 
spent in each. We perform this calculation across all 30 users 
and normalize the result. We compare this distribution for 
the real and fake datasets. Tableshows the KL-divergence 
of the real (training) dataset to the fake datasets and base¬ 
lines: real testing dataset; uniform time allocation (each user 
spends 1/A; proportion of time at each of the k locations); 
random time allocation (each user spends a uniformly ran¬ 
dom proportion of time at the location). This statistic is 
highly preserved in the fakes; sometimes the fake datasets’ 
distribution is closer to that of the real (training) dataset, 
than the distribution of the real testing dataset is. 

The previous results provide confidence that useful infor¬ 
mation is indeed preserved in the fake traces dataset. That 
said, our original goal was to preserve utility in the sense 
of semantic similarity, so it sensible to wonder how close we 
are to that goal. To determine this, we first compute the se¬ 
mantic similarity of each fake trace with its own seed trace 
to check if the semantic features of the original traces are 
indeed preserved. Figure [^illustrates the distribution of this 
value over all fake traces. Clearly, the distribution is biased 
towards higher similarity values. So, the fake traces consid¬ 
erably preserve the semantic features of the real traces. 

Another type of statistics that we would expect the set 
of fake traces to preserve is the inner similarity between 
the set of traces. In Figure [^ we present the correlation 
between two distributions: semantic similarity among real 
traces, and semantic similarity among fake traces. The Q-Q 
plot shows a significant correlation between these two distri¬ 
butions; they are strongly linearly related. This reflect that 
in addition to maintaining the information about each seed, 
we also preserve the statistical relation among the traces. 

Results show that fake traces cannot be distinguished from 
the real ones if it appears among some real traces. This is 
because the relation between a fake trace and the distribu¬ 
tion of real traces is largely indifferent from that of a real 
trace with respect to both semantic and geographic features. 


5.6 Scenario: Using Location-based Services 

The utility and privacy evaluation for the publishing dataset 
scenario applies to the case where traces are shared with a 
service provider. However, we can perform a more specific 
analysis on the fake locations when they are shared in a new 
setting. We present the details of how fake locations are used 


Testing 

Fakes 

Uniform 

Single 

0.0377 

Mean 

Std 

1.1918 

4.6666 

0.3841 

0.0432 


Table 2: KL-divergence of the location visiting probabilities 
of the real (training) datasets against the 10 fake datasets, 
and various baselines. “Testing” is the testing portion of the 
real dataset (see Section 5.3); “Uniform” is the uniform dis¬ 
tribution over all locations; and “Single” is the distribution 
where all users always visit the same location. 


to protect location privacy, and how they perform against 
inference attacks despite the fact that they have passed pri¬ 
vacy guarantee tests. 

5.6.1 Setup 

We assume a user shares her current location with a location- 
based service with a probability /3 (set to 0.5 in our case). 
The service provider, in return, provides the user with con¬ 
textual information about the shared locations (e.g., list of 
nearby restaurants, current traffic information on the road). 
The service provider would receive a sequence of locations 
that are visited by the user at different time instants. To 
protect her location privacy, i.e., hiding her location at the 
time of access to the LBS and also preventing the inference of 
the full trajectory, the user sends a number of fake locations 
along with her true location. We assume the user has access 
to some full fake traces, and at any time instant t, when she 
is accessing the server, she consistently adds the locations 
visited at t on each fake trace to her actual location at t and 
sends them to the server. The service provider responds to 
each of these location queries, and the user needs to filter 
out the results associated with fake locations to obtain the 
information about her true location. 

We evaluate a few other methods to generate fake loca¬ 
tions along with our method for comparison. 

• Uniform IID: We sample each fake location indepen¬ 
dently and identically distributed from the uniform 
probability distribution. 

• Aggregate Mobility IID: We sample each fake location 
independently and identically distributed from the ag¬ 
gregate mobility probability distribution it. 

• Random Walk on Aggregate Mobility: We sample a 
fake trace by doing a random walk on the set of loca¬ 
tions following the probability distribution p. 

• Random Walk on User’s Mobility: We do a random 
walk on the set of locations following the probability 
distribution p{u) to generate a fake trace. 

5.6.2 Privacy 

We assume the adversary wants to filter out the fake lo¬ 
cations and to find the true sequence of locations that are 


Testing 

Fakes 

Uniform 

Single 

0.0215 

Mean 

Std 

0.2040 

5.1131 

0.0289 

0.0086 


Table 3: KL-divergence of the 50 most popular location vis¬ 
iting probabilities of the real (training) datasets against the 
10 fake datasets, and various baselines (see Table 0. 
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0.0189 

Mean 

Std 

0.1652 

0.6794 


0.0125 

0.0022 

to 

p 

0.0026 

Mean 

Std 

0.0778 

0.5360 

a 

0.0092 

0.0031 

CO 

0.0114 

Mean 

Std 

0.0779 

0.5092 

3- 

0.0089 

0.0036 


Table 4: KL-divergence of the the users time allocation dis¬ 
tribution among the three most popular locations (of each 
user) of the real (training) datasets against the 10 fake 
datasets, and various baselines. 
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Figure 9: Location privacy versus utility loss for different 
fake generating algorithms. The privacy is measured as 
probability of error of adversary in guessing the correct lo¬ 
cation of users. We plot the median location privacy across 
all 30 users. The probability of connecting to the LBS is set 
to 0.5, so one half of the time instants the users connect to 
the server. We evaluate the use of 1, 5,10 fake traces, hence 
three dots for each algorithm. (We repeated the experiment 
20 times and took the average: 4 times with a different se¬ 
lection of fake traces, and for each of such selection, 5 times 
to eliminate the randomness of exposures.) The utility loss 
is the number of distinct locations that are sent to the server 
( |9a| , and the number of (distinct) semantic classes (i.e., clus¬ 
ters) exposed for each event (9b). 


gorithms, but the average number of distinct locations sent 
to the LBS is not the same. This is because of the high 
randomness in the Uniform IID^ Agg Mobility IID, and RW 
Agg Mobility that select fake traces from all possible loca¬ 
tions. Our method and the RW User Profile method have 
both lower diversity overhead and lower semantic overhead 
in the set of fake locations. Our method, clearly outper¬ 
forms all the tested methods, especially the random strate¬ 
gies. For the case of RW User Profile method, the privacy 
level against tracking attack gets closer to what we achieve 
(which is very close to the maximum), due to the fact that 
the fake traces generated by RW User Profile are geographi¬ 
cally very similar to the true location of the user, and hence 
creates high confusion, hence error, for the adversary. Note 
that the RW User Profile is never a privacy-preserving fake 
injection method as the adversary can easily de-anonymize 
and prohle the user, no matter if he makes mistakes on ex¬ 
actly tracking the user at each access time (as shown here). 

Whereas, our method is ensured to have minimal geo¬ 
graphic mutual information with the true trace, thus it is 
robust against profiling attack. We also ensure that the 
fake traces have small differential semantic similarity, thus 
they are robust against de-anonymization. Additionally, the 
plot shows that our method is the strongest fake generating 
algorithm against an attacker who is interested in hltering 
out the fake locations and localize the user over time. 


visited by the user. The privacy metric that we use is based 
on the error of adversary in his inference attack [^. Put 
simply, the fraction of true locations that are missed by the 
adversary is our privacy metric. More precisely, the metric 
is the probability of error of inference attack on guessing 
the correct location. We assume the adversary makes use 
of the aggregate mobility model (p, tt) to single out the true 
locations and reconstruct the true location of the user. 

5.6.3 Utility 

There is an overhead to these privacy-preserving mecha¬ 
nisms, as a user has to send more than one query to the 
server to get the results for one query. This can be inter¬ 
preted as utility loss, and so we define two metrics for (the 
lack of) utility. The first is the number of distinct locations 
sent by the user at each time (diversity overhead). Note 
that this number can be less than the number of fake traces 
as they might intersect at the times of connection to the 
server. Additionally, some service providers (e.g., Google 
Now) might profile the user over time based on the type of 
locations she visits, in order to provide recommendations or 
reminders. In these cases, the queries that are sent to the 
server can pollute the prohle of the user hence reduce the 
predictability power of the service provider. For this, we 
use the number of (distinct) semantic clusters among the 
locations sent by the user at each time (semantic overhead). 


5.6.4 Results 

Figure shows the tradeoff between location privacy and 
utility for various methods of generating fake traces. We 
evaluate the utility loss in terms of two metrics: diversity 
overhead (Figure [^, and semantic overhead (Figure [91^. 
We evaluate the privacy for exposure probability /3 = 0.5, 
and three different number of fake traces: 1, 5,10. Although 
the number of fake traces are the same, across different al- 


6. RELATED WORK 

Location obfuscation is a prevalent non-cryptographic tech¬ 
nique to protect location privacy. It does not require chang¬ 
ing the infrastructure, as it can also be done all on the user’s 
side either by altering (perturbing) the location coordinates 
to be reported or by sending fake location reports interleaved 
or along with the true location of the user. 

Many location perturbation techniques have been pro¬ 
posed in the literature, usually based on adding some noise 
to the user’s loca tion coordinates or reducing its granular¬ 
ity, e.g., [^[^[^[^. The downside of these techniques is 
that they reduce the service quality of the user in interaction 
with the location-based service (LBS) provider. This is be¬ 
cause the server provides contextual information related to 
the shared location and not the true location of the user. So, 
users have to trade service quality to obtain their required 
level of privacy. Optimal solutions for location perturbation 
techniques are proposed |4 28 which show the high cost of 
location privacy on service quality using perturbation. 

Hiding the user’s true location among fake locations is a 
promising yet very little-explored approach to protecting lo¬ 
cation privacy. There are few simple techniques proposed 
so far: adding independently selected fake locations drawn 
from the population’s location distribution [^, generating 
dummy locations at random as a random walk on a grid 
[13| , constructing fake driving trips by building the path 

between two random locations on t he map given the more 
probable paths traveled by drivers [^, or adding noise to 
the paths generated by road trip planner algorithms [^. 
These solutions lack a formal model for human mobility and 
do not consider the semantics associated with sequence of 
locations visited by people over time. Thus, the generated 
traces can be distinguished from real location traces. 

This paper, to the best of our knowledge, is the hrst that 
proposes a systematic methodology for generating fake loca¬ 
tion traces based on statistical features of both geographical 















and semantic dimensions of real traces, and based on a met¬ 
ric to measure how realistic a synthetic trace is. Moreover, 
we introduce multiple privacy tests to ensure that the pub¬ 
lished/shared fake traces themselves do not leak informa¬ 
tion about real seed traces. Our evaluation on real data also 
shows the clear advantage of our algorithm with respect to 
other existing approaches against known inference attacks. 

7. CONCLUSIONS 

This is the first paper to address the problem of generat¬ 
ing realistic synthetic location traces based on a quantitative 
metrics. Generating such traces is very useful to protect pri¬ 
vacy of users when they share location with location-based 
services, or when we want to publish a dataset of locations 
to be used for various research reasons. Based on well- 
established statistical methods, we propose two metrics to 
quantify geographic and semantic features of human mobil¬ 
ity. Using these metrics, we propose efficient algorithms to 
generate fake traces that do not leak geographic information 
about real individuals (guaranteed using a privacy rejection 
test), yet highly resemble the mobility of a population se¬ 
mantically. We show that inference attacks cannot identify 
the true location of mobile users if our fake traces are used 
as protection. We also quantitatively show that our method 
is superior to all existing methods of generating fake traces. 
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